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Abstract. Hirsch’s /i-index is perhaps the most popular citation-based measure of scientific excellence. 
In 2013 G. lonescu and B. Chopard proposed an agent-based model describing a process for generating 
publications and citations in an abstract scientific community. Within such a framework, one may simulate 
a scientist’s activity, and - by extension - investigate the whole community of researchers. Even though the 
lonescu and Chopard model predicts the /i-index quite well, the authors provided a solution based solely 
on simulations. In this paper, we complete their results with exact, analytic formulas. What is more, by 
considering a simplified version of the lonescu-Chopard model, we obtained a compact, easy to compute 
formula for the h-index. The derived approximate and exact solutions are investigated on a simulated and 
real-world data sets. 

PACS. XX.XX.XX No PACS code given 


1 Introduction 

Since the 1999 seminal paper by Barabasi and Albert [T] 
many methods that originally were developed in statisti¬ 
cal physics have been successfully applied in a wide range 
of problems coming from diverse domains. Scientometrics, 
an area in which one is concerned with the quantitative 
characteristics of science and scientific research, is one of 
such domains. Recently, different authors studied - among 
others - the long term prediction of scientific success [2], 
impact that an affiliation change has on a scientist’s pro¬ 
ductivity [3] , or production and consumption of the knowl¬ 
edge in physics HIS]. However, historically main efforts 
were focused on the study of the structure of citation net¬ 
works [6ll7l|8l[9] , and the reproduction of their degree dis¬ 
tributions [siiinEiEg. Starting from the de Solla Price 
seminal work m it is a known fact that citation net¬ 
works arise due to the preferential attachment rule [1]. 
This process, well known in complex network analysis [51 
nmn], was studied from the point of view of citation net¬ 
works [Tll^ fT^ITKlITB] . where it is also known as the rich 
get richer rule or the Matthew effect M- Different varia¬ 
tions of the classical, linear, preferential attachment (see 
m or Table 1 in [7]) were considered, but to the best of 
our knowledge there is a lack of models in the literature 
which concern the ft,-index (except |17] , which is described 
in Sec. O. 


^ Corresponding author; e-mail: zogala@ibspan.waw.pl. 


The ft-index proposed in 2005 by J.E. Hirsch [T5] is 
the most popular citation-based measure of scientific ex¬ 
cellence. Even though this data fusion tool was already 
studied in the 1940s (compare the notion of the Ky Fan 
metric m and also the Sugeno integral, see, e.g., [10]), it 
may be conceived as a turning point in the history of scien¬ 
tometrics. The idea standing behind the Hirsch index is to 
measure not only the overall quality of a scientist’s output 
(most often expressed by the number of citations that each 
individual paper received), but also its size. Thus, it may 
be understood as a measure of both productivity and im¬ 
pact of a researcher (or an institution). More formally, let 
us assume that we are given a list S = (^i,..., Sn) S Nq , 
where Si denotes the number of citations to the t-th paper. 
If S(n) ^ tli6 Hirsch index is given by the formula: 

ft-index = max {ft = 1,..., n : S(^n-h+i) ^ 

where denotes the (n —ft-|-l)-th order statistic of 

S. Moreover, if = 0, then ft-index = 0. Intuitively, an 
author has his/her ft-index equal to H, if H of her/his n 
papers have at least H citations each, and the other n — H 
papers have at most H citations each. 

There were a few papers devoted to the stochastic 
properties of the ft-index in some simple probabilistic mod¬ 
els, see [1I1111113J111]). Recently, lonescu and Chopard in 
HZ] considered a publication-citation process in an ab¬ 
stract scientific community which was described by a multi¬ 
agent model. Such a model consists of a scientist produc¬ 
ing new papers, giving citations to the already published 
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papers (including his/her own ones), and receiving cita¬ 
tions from the community. This bottom-up approach al¬ 
lows to simulate a single scientist’s activity as well as to 
investigate the whole community of researchers. What was 
very inspiring for us is the fact, that Fig. 3. in m is a 
perfect illustration of the mechanism of lonescu-Chopard 
model, but this de Solla Price article was published almost 
50 years before lonescu and Chopard paper. Nevertheless, 
it turns out that their approach predicts quite well the 
/i-index from bibliometric data. However, its authors did 
not provide an analytic form of a solution to their model, 
relying only on Monte Carlo simulations instead. In the 
current work we present an exact solution to that model 
as well as its simplification and an application on real- 
world data. 

The paper is organized as follows. In Sec. [5] the agent- 
based model proposed by lonescu and Chopard (referred 
to as the IC model) is described in very detail. Sec. [3] 
presents theoretical results concerning exact formulas for 
vectors of citations and the results of comparative simu¬ 
lation studies. In Sec. 0] a simplified model is proposed. 
Next, in Sec. [5] the results of an empirical analysis con¬ 
cerning all investigated approximations of the h-index are 
presented. Finally, Sec. [6] concludes the paper. 


2 The IC single-scientist model 

In 2013 lonescu and Chopard m introduced a multi¬ 
agent model to describe a publication-citation generation 
process in an abstract scientific community. Their approach 
consists of a scientist producing new papers, giving ci¬ 
tations to his own and other already published papers, 
and receiving citations from the community. The model is 
based on a preferential attachment rule [T] , which was ob¬ 
served in many real-world systems m- As we mentioned 
before, preferential attachment rule is strongly connected 
with the so-called Matthew effect M- highly cited articles 
are more eagerly cited by other authors than lowly cited 
ones. More precisely, the probability of adding new cita¬ 
tions to a paper is proportional to the number of citations 
it has already obtained. 


2.1 Simulation description 


The simulation of interest is an iterative process. We 
start with an initial number of papers Nq, none of which 
is cited. During each iteration we add a new paper to the 
collection and distribute both self and external citations 
to the existing papers according to the preferential at¬ 
tachment rule. We give a fixed number of p internal and 
q external citations to the fc-th paper with probability of: 

Pk = — -, fc = l,...,n. (1) 

^ Xi+n 
1=1 

Due to the form of the given probability distribution, in 
m it is assumed that only external citations are taken 
into account when assigning the new ones. Self citations do 
not influence a paper’s importance. Once the fixed number 
N of published papers is reached, the process goes on, but 
only q external citations are being granted during each 
step. The simulation ends as soon as the total number of 
citations M has been distributed. 


Simulation steps in the IC model Let us now formalize 
the aforementioned procedure. Such a detailed introduc¬ 
tion is crucial for solving the model: the simulation may 
end up on different stages depending on parameter values. 
The IC model is based on the following input parameters: 

(a) the number of papers TV € N, 

(b) the total number of citations TH S N, 

(c) the number of self citations added in each step p £ N, 

(d) the number of external citations added in each step 
q GN and 

(e) the initial number of papers with no citations at the 
beginning TVq £ N. 


The initial values for sequences X and Y are given by 


A® = 0, 


= 0 and r/"' = 0 
At) 


( 0 ) 


= 0. Val¬ 


ues A/ t and Y/ ^ denote the number of external and self 
citations, respectively, of the fc-th paper in the t-th itera¬ 
tion. Before the fc-th paper is published, its citation counts 


are set to 0. Thus, X^'^ = = 0 for fc > t. Neverthe¬ 

less, please note that this assumption has no impact on 
further derivations, as it is well-known that papers with 
no citations do not influence the h-index value. 


Unlike in the case of various well-known models for con¬ 
structing citation networks [inmsiiis], the IC model fo¬ 
cuses not on the overall structure of a citation network 
but only on the node degree distribution, i.e., on the num¬ 
ber of citations of papers written by one author. Its aim 
is to approximate citation scores for each published pa¬ 
per of a given author, i.e., an N dimensional vector S = 
(^i ,..., Sn), where Sk denotes the number of citations of 
the fc-th paper. By definition, this shall be based solely on 
the number N of papers he/she published as well as the 
total number M of citations that his/her papers obtained. 
Moreover, we assume that citations to each paper Sk are 
of two kinds: external Xk and internal (self) ones Y^, thus 
Sk = Xk -\-Yk. 


The simulation consists of the three following phases. 

Phase 0. Firstly, we initialize the variables Ai,..., Ajv^ 
and Yi,..., Yatq, and set t = Nq. In the hrst step of the 
next phase we are going to distribute citations across the 
first Nq articles. In other words, the considered author has 
already published her/his first Nq articles and is waiting 
for citations. Two cases are possible: 

— Nq N ^ the author published less than Nq pa¬ 
pers. In such a case, the simulation ends before going 
to phase (I), even though it is possible that there are 
still citations left to be distributed. We could try going 
straight to phase (B) and distribute these citations, yet 
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it would not increase the precision of the h-index esti¬ 
mation significantly (due to the fact that Nq is small). 
On the other hand, this would unnecessarily compli¬ 
cate the formulas for Xk and Y^. 

— TVq < iV —>■ there are enough papers and citations to 
go to phase (I). 


3.1 External citations 

Please notice that the sums in Eqs. m and o are in fact 
the expected values of random variables from binomial dis¬ 
tributions Bui{q,pk,t) and Bin(g,pfe_t), respectively, where 
probabilities Pk,t, Pk,t are given by: 


Phase (I) For each t = Nq + I,..., min{Af, -|- Nq}, 
we distribute q external and p self citations according to 
the preferential attachment rule given by Eq. 

^ J • ^ + j), (2) 

+Ej ^ yf^ + j)- m 

j=0 

When phase (I) comes to an end (which means that the au¬ 
thor has already published all her/his works and obtained 
all self citations), the three following cases are possible: 

— -I- Nq = N ^ simulation ends with no citations to 
distribute left, 

— -I- Nq < iV —>■ simulation ends, even if there are 
possibly up to p -|- (7 — 1 undistributed citations left. 
In this case we could distribute such leftover citations, 
yet it would not increase the precision of the h-index 
estimation significantly and would unnecessarily com¬ 
plicate the formulas for and Y^, 

— + Nq > N ^ simulation does not end, there are 
still citations to be distributed. We go to phase (II). 


Phase (II) For each t = N+ 1,..., \+N, 

we shall distribute only the external citations among the 
already published N papers: 

+Ej ■ ^ (4) 

j=0 

~ ^k 

When phase (II) comes to an end, two situations are pos¬ 
sible: 

— (M — (TV — No){p + q)) mod g = 0 —simulation ends, 
no citations to distribute left, 

— (M — (TV — No){p + q)) mod g 0 —)■ simulation ends, 

even though there are possibly up to q—1 undistributed 
citations left. The reason to abandon the leftover cita¬ 
tions distribution is the same as in phase (I). 


3 Exact formulas for citation vectors 

Let us now present the exact formulas for xj^'^ and 
derived for the IC model. 


Pk,t = p(E‘”'^ ^ + 1 ) = 


( 5 ) 


=p(n 


(t-i) 




(t-i) 




X 


(t-i) 


- 1-1 


1 = 1 


k ^ t, 
k > t, 


and 


Pk,t = p(4‘”'^ ^ + 1) 


w, 


(t-1) 


+ 1 


N 

EX, 

1=1 


(t-i) 


-, for k ^ N, t > N. 


■N 


The value of X^ in the t-th step can be written as: 




Xk +qpk,t,t ^ N, 


Xt"'^ +qpk,t,t> N, 


y(‘-i) I 

t < TV, 

^ E 


vit-l) , 

Xk + N 

t > N. 


( 6 ) 


E xi' 
1=1 


’+N 


The sums in the denominators are equal to 

min{t,7V} 

E Xt"^ =q{t-l-No). 
1=1 


Therefore 

r(t) 




1 = 




(t-i) 


+ 1 ) (1 + 


t(( 3 r+l)-lj(ArQ-|-l) 


t ^ TV, 


(4*-')+ !)(! +-a 


tq+N—q{NQ + l) 


)' 

J , t > TV, 


and now this recurrence relation can be solved easily. We 
wish to find Xk = where the value of Yiax de¬ 

pends on whether the simulation stops in phase (I) or (II). 
When solving the recurrence equations we continue until 
reaching = 0 or X^^°^ = 0. As a consequence, if 

Tmax ^ X, then we obtain: 


Afc= n (1+ 


l = tn 


l{q + l) -q{No + l) 


and if t^ax > then it holds: 


N 


Xk= n 1+ 


^—^min 
^max 

X n 

l=N+l 


1 + 


liq + l) - g(TVo -h 1) 
_ q _ 

Tg -I- TV - g(TVo -I- 1) 


- 1 , 


- 1 , 
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where 


3.2 Self citations 


tmin = maxjiVo -I- l,k}, 

. _/L^J+^o, if L^j+iVo^iv, 

max I ^ M-{N-No){p+g} ^ ^ + JYo > iV. 

Please note that we can simplify the above formula by 
relying on the notion of the F function. Firstly, observe 
that: 


rr +_ - _1 

l{q+l)-q{No + l)) na-/3l)’ 

TT 1 ^ = n (^ - 0^2) 

lq + N-q{No + l)J na-/ 32 )’ 


where: 


ai 

02 


qNp 

q+l’ 


/3i 


qNp-N 

q 


q{No + 1) 

g + 1 


Pi = 02 + 1- 


Similarly as in the previous subsection, we can solve the 
equation for self citations distribution. Basing on Eqs. © 
and we have: 


_ y 

^ u — ^ i 


(t-i) 


+ PPk,t 


- . 


+ 1 ) 

— l — No) + t 


= Y, 


(t-i) 


q{t-l - No) -b t 


n q{l-l-No) + F 


E 


+ 


q{i -1 - No) + i , J 
P 


i.n qQ-l-No) + h 


(tmin - 1 - No) + tn 


We would like to find Yfe = and due to the fact 

that here we always end up in phase (I), Smax is equal to: 


■^max 


= min{V [ 



+ No}. 


Moreover, 


Hence, 


r{l-a + l) 
~ r{l-a) 


and 


®max 



P _ 

l-iVo)+i 


n 

^—^min 


+ 




min 


P _ 

1 — A^o) H" tmin 



_ i ^ 

q{l-l-No) + lJ 


n(^-a) 


l — ti 


r{t2 -0-1-1) 

r{ti - o) 


Hence, the formula for Xk can be written as: 


^max - 02 F{N - Oj -|- l)F(fmin - Pi) 
N-a 2 r{t^in-ai)r{N-Pi+l) 


with: 


Also, the formula for Yk may be simplified as follows: 


u= E 


r{i- o)F(tmin - O) 


* — ^min“l" 1 


q{i-l- No) + i F(tmin - P)k^ip - P) 


+ 


(^min — 1 — No) + tn 


( 8 ) 


where 


a = 


qNp 

9 + 1’ 


q{No + 1) 

9+1 


tmin = maxjtVo + l,fc}, 

E^J+lVo, 

y M-(N-No)(p+q) 


f _ ) Lp+g 
^max — 


■No 


Ol = 


qNo 


Pl = 


qjNp + 1) 
9 + 1 ’ 


if L^J+iVo^iV, 
if L^J+iVo>iv, 

qNo-N 


02 = 


The above simplification gives a more elegant represen¬ 
tation of X. However, it is worth noting that the prod¬ 
uct form is more computationally stable than calculating 
gamma functions for large arguments. Due to this fact in 
our simulations we use the product form. Nevertheless, 
both representations enable us to compute the elements 
of X significantly faster than in the case of the simulation 
procedure presented in HZ]. 


3.3 Non-integer values of p and q 

The authors of the IC model mention in m that, given 
non-integer values of p and g, one distributes: 

/ _ J |"p] with probability 1 — (|"p] — p) 

^ 1 |"p] — 1 with probability ([p] — p) 

self-citations as well as: 

, _ J [g] with probability 1 — ([q] — q) 

^ ~ Eq] — 1 with probability ([q] — q) 

citations given by the scientific community. Therefore, as 
with probabilities of 1 — ([p] — p) and 1 — ([q] — q) the 
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average number of self- and external citations is equal to 
p and q, respectively. 

Let us consider and let the second summand in 
Eq. © be denoted as: 


ml) 


qPk,t t ^ N, 
qpk,t t > N, 


where the number of external citations to the fc-th paper 
obtained at time t, , follows the Jiui(q,pk^t) distribu¬ 
tion for t ^ N and the Bm{q,pk^t) distribution for t > N. 
Moreover, 


Fin* \n'\ - / WPfc.t with probability 1 - ([g] - q), 

[WkW ) I ^ with probability ([g] - g). 

Taking into account the distribution of q', we have that: 

E(Q‘)=E(gl|g'=rgl)P(g'=rgl) 

+ E(Q‘,|g'= rgl-l)P(g'= M-l). 

Therefore, Eq. © is now of the form: 

+ (M - - q ) 

= ^ + (r^i +1+9 ~ r^i “ 

= + qpk,t- 

By relying on a similar reasoning we obtain: 

^k^ = + \p\pk,t{i+p - [pD 

+ (Tpi - i)pfc.t(rpi -p) 

= +ppk,t- 

It is easily seen that final form of the result is the same 
as for the formulas for integer p and g. 


3.4 Overall number of citations and the h-index 

Once we have determined the formulas for external 
and self Yfc citations corresponding to the lonescu and 
Chopard model, the only action left to estimate h-index 
is just to sum them up. One sees that both Xk and Yk are 
nondecreasing, so Sk = Xk -f Tfe is also nondecreasing and 
thus: 

hexact = max{/c : Sk ^ k}. 


3.5 Comparative simulation study 

Let us now briefly compare the estimates of the h-index 
obtained with the IC model (denoted as hic) and hexact, 
i.e., the ones that are based on Eq. © and Eq. ©. We 
consider the vector of citations of J.E. Hirsch himself. 
The data were gathered on July 30, 2015 from the Sco¬ 
pus database. The vector consists of the total number of 


M = 13480 citations and the total number of = 205 
publications. The h-index of Hirsch is equal to 52. 

According to [HI, parameters p and g giving the best 
global agreement between the IC model and the original 
h-index are equal to 1 and 2, respectively. However, the 
authors also stated that p and g can be tuned up in such a 
way that almost any scientific profile can be fit well. In the 
case of the h-index of Hirsch, we found out that p = I and 
g = 3 gives a reasonable agreement, while for g = 2 the fi¬ 
nal h-index is overestimated. Please note that the model is 
stochastic in its nature and its results vary across different 
simulation runs, even for the same values of p, g, M and 
N. Therefore, for the purpose of a sensible comparison, 
we analyzed 1000 samples for p = 1, g = 1, 2,..., 10 and 
Nq = p -f g. The hic distribution estimates are presented 
in Fig. [1] in a form of box-and-wiskers plottQ Additionally, 
the hexact and the h-index obtained by averaging the cita¬ 
tion vectors as generated by the IC model are indicated. 
We may observe a high agreement between hic computed 
on an averaged citation vector and hexact (the largest dif¬ 
ference between these two estimates, i.e., |hic — hexacti is 
equal to 1). 


70 


60 


50 


40 



q 

Fig. 1: Boxplots for the distribution of the h-index of 
Hirsch as estimated via the IC model. Additionally, the 
h-index computed according to citation vectors obtained 
via Eqs. © and © is marked with A and the h-index ob¬ 
tained from averaged citation vectors from the IC model 
by X. 


^ The box-and-whisker plot aims to graphically represent an 
empirical distribution of a given sample. The box ranges from 
the first (Qi) to the third (Qa) quartile and the bold line 
gives the median. The whiskers range from max{Min, Qi — 
1.5((53 — Qi)} to min{Max,(53 -|- 1.5{Q3 — Qi)}. Moreover, 
each (o) marks an outlier, that is an observation less than 
Qi - 1.5(Q3 — Qi) or greater than Qs + 1.5(Q3 — Qi))- 
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(a) Vector of external citations X. 

25 

20 

I 
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Fig. 2: Step plots of vectors of external X and self cita¬ 
tions Y obtained from the IC model, depicted by X, and 
Eqs. (O and dS]), depicted by o. 



10 20 
Paper Index 


50 100 200 


(b) Vector of self citations Y. 



As it was stated in HZ], the initial number of publica¬ 
tions Nq should be small enough so as to not influence the 
rest of the process, but large enough in order to provide 
enough papers to cite in the first iteration. The authors 
suggest to choose Nq = p + q. In order to assess the influ¬ 
ence of this parameter on hic and hexcat, we analyze Aq 
varying from 1 to 50, for p = 1 and q = 2. Please note 
that Nq = 50 is nearly 25% of all the Hirsch’s publica¬ 
tion count. For the IC model, we perform 1000 runs and 
average the obtained values: AVR(/iic) denotes the mean 
of hic obtained in each run and sd its standard devia¬ 
tion, while /iic(AVR) denotes the /i-index computed on an 
averaged citation vector from the IC model. The results 
presented in Table [T] suggest that there is no significant 
difference between Nq = 1,2 and Nq = p + q = 3 in this 
case. Therefore, one may choose Nq = 1 and if Af = 1 and 
M > 0, simply assign the h-index equal to 1. 


Fig. [5] presents the step plots of vectors of external 
citations X (a) and self citations Y (b) obtained from the 
averaged (over 1000 runs) IC model as well as Eq. (El),®. 
The sum of squared differences between the simulated and 
analytical results are equal, respectively, 7.22 and 0.002. 
The real value of the h-index of Hirsch is equal to 52 and 
the estimated values (for parameters p = 1 and q = 2) are 
equal to hic(AVR) = 54 and hexact = 54. 


Table 1: Aggregated results of 1000 runs of the IC model 
for p = 1, q = 2 and Nq € {1,..., 10,15, 20, 25, 50}, where 
AVR(hic) denote the mean hic, sd its standard deviation, 
hic(AVR) the h-index computed on an averaged citation 
vector from IC model and hexact ~ the h-index obtained 
via analytical formulas. 


No 

AVR(tiic) 

sd 

AVR(tiic) ± sd 

/iic(AVR) 

^exact 

1 

56.19 

2.47 

(53.71;58.66) 

54 

54 

2 

56.52 

2.35 

(54.17;58.87) 

54 

54 

3 

56.73 

2.36 

(54.38;59.09) 

54 

54 

4 

57.04 

2.38 

(54.66;59.42) 

55 

55 

5 

57.34 

2.30 

(55.04;59.64) 

55 

55 

6 

57.51 

2.30 

(55.21;59.81) 

55 

55 

7 

57.81 

2.37 

(55.44;60.18) 

56 

56 

8 

58.21 

2.27 

(55.94;60.48) 

57 

56 

9 

58.43 

2.34 

(56.08;60.77) 

56 

56 

10 

58.63 

2.43 

(56.2;61.07) 

57 

56 

15 

59.70 

2.33 

(57.36;62.03) 

58 

58 

20 

60.85 

2.29 

(58.56;63.14) 

60 

60 

25 

61.79 

2.30 

(59.49;64.09) 

61 

62 

50 

65.23 

2.30 

(62.93;67.53) 

71 

71 


4 A simplification of the IC model 


Please note that the exact solution to the IC model, i.e., 
Eqs. ® and ® , gives an analytical expression of the very 
intuitive and reasonable simulation setup as proposed by 
lonescu and Chopard. Nevertheless, as the form of the 
derived formulas is quite complicated, their intuitive in¬ 
terpretation is difficult. 

Let us recall that the IC model is based on an assump¬ 
tion that only external citations are taken into account 
when assigning new ones (due to the form of the proba¬ 
bility distribution given by Eq. ©)• Moreover, during the 
simulation study, as it was also stated in m, we observed 
that the parameter p has no significant influence on the 
outcoming h-index. Hence, in this section we reduce the 
number of parameters, which leads to a signihcant simpli¬ 
fication of the model. 

Let us employ the following assumptions: 

(i) We assume A^o = 0, so the first paper starts to gain 
citations just after its publication. 

(ii) We consider only one vector X, which means that we 
take into account all the citations together without 
distinguishing between external and self citations. 

The number of simulation parameters is decreased to only 
two: q, which is the number of citations given in each itera¬ 
tion and T, which is number of simulation steps. Similarly 
as in Sec. [3] let us write the recurrence relation for 


q{x, 


xW ^ — 

k k f, 

= 0, 


(t-1) 


1 ) 


(t-1) 


H" t 


fc = 1,..., t, 

fc = t -b 1, ..., T, 
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which may be expressed as: 


Xk = Xk{T) = 


I 


l^k 


^ + 9/(9 + 1 ) 


-1, 


Table 2: Basic sample statistics (the Scopus data set) of 
the number of published papers by an author (N), total 
number of citations he/she received (M), number of ci- 
(9) tations to his/her most (max) and least (min) frequently 
cited paper. 


which may be further simplified as: 


Xk 


r{T+l) r{k-q/{q + l)) 
r{T+l-q/{q + l)) r{k) 


1 . ( 10 ) 


Eq. m is the exact solution of our simplified version of 
the model, but the following asymptotic relation: 


lim 

t—^OO 


r{t +a) 


= 1, 


allows us to obtain the approximation of Xk as: 




r{T + i) r{k)k-^ /r + i\“ 
r(r + i)(T + i)-“ r{k) k ) 


where a = q/{q+ \). 

Even with our exact solution finding the compact for¬ 
mula for the /i-index seems untraceable. Eortunately, for 
the simplification of the IC model, an observation that Xk 
is an increasing function of k leads to: 


h = 



which is equivalent to: 

(/r + l)/r“ = (T+l)“. 


( 11 ) 


One can show that for every T > 0 and a G (0,1) Eq. (fTTl) 
has always exactly one solution, which is the /i-index. 


5 Real data evaluation 

In this section we perform an empirical analysis of ex¬ 
emplary citation vectors gathered from Elsevier’s Scopus 
(see [3^ for the detailed description of the data set). Please 
note that the whole data set includes citation vectors cor¬ 
responding to 16282 authors. Nevertheless, about 78% of 
all the vectors are of length one (among them ca. 32% 
represent a single uncited paper). This is typical to biblio- 
metric data sets, which consist of a high number of short 
vectors. Moreover, since it is observed that all the vectors 
are skewed, usually to model them distributions like ex¬ 
ponential or Pareto type II (Lomax) are used, (e.g., com¬ 
pare [371I351I33] '). Table [3] presents basic sample statistics 
for the Scopus data set. 

For the sake of clarity of the results presented in this 
paper, a subset of 100 randomly chosen authors has been 
selected. In order to assess the quality of the proposed ap¬ 
proximation we choose vectors of length greater than or 
equal to 20 (in total number of 69) and from the vectors of 
length smaller than 20 we randomly choose 31 with uni¬ 
form distribution. Basic sample statistics of the selected 
sample are presented in Table |31 



N 

M 

max 

min 

Min. 

1 

0 

0 

0 

1st Qu. 

1 

0 

0 

0 

Median 

1 

3 

3 

1 

Mean 

1.67 

13.53 

9.10 

5.72 

3rd Qu. 

1 

11 

9 

5 

Max. 

129 

2396 

836 

836 


Table 3: Basic sample statistics of the selected sample from 
the Scopus data set. 



N 

M 

max 

min 

Min. 

1 

0 

0 

0 

1st Qu. 

6 

45.75 

19.50 

0 

Median 

21.50 

207.50 

36 

0 

Mean 

26.79 

369.60 

76.36 

1.34 

3rd Qu. 

31.75 

486.20 

102 

0 

Max. 

129 

2396 

636 

20 


In Fig. [3] there are presented the approximated val¬ 
ues of /i-index as a function of real values from consid¬ 
ered data. Please note that the mean squared difference 
between the estimated values and the /i-index equals to 
6.15, 6.45 and 4.14, respectively for the estimates based 
on Eq. d?]), Eq. (ITTI) and Eq. dH). 

Please note that in the case of Hirsch himself, consid¬ 
ered in Sec. 1331 the approximations of the ft,-index are 
equal to 51.99 ~ 52 for approximation given by Eq. (HU 
and 52 for approximation based on Eq. The obtained 
estimates of the Hirsch /i-index for various values of pa¬ 
rameter q are presented in Table 01 Please note that by 
an appropriate selection of the parameter we were able 
to recreate the value of his /i-index. Moreover, Fig. 0] de¬ 
picts its predicted growth dynamic over each iteration. We 
see that our simplification does not predict the /i-index 
worse than the original simulation. However, one should 
be aware that the approximate ft.-index given by Eq. (HU 
is not necessarily an integer value (compare Fig. [3] and 
Table 0]) . This should be taken into account in analysis 
of real data sets: if needed, e.g., proper rounding can be 
applied. 


6 Conclusions 

In this paper we investigated an agent-based model for the 
bibliometric /i-index introduced in HU- The main contri¬ 
bution included is an exact formula for the number of 
external citations and self citations for each paper pro¬ 
duced by a given author. This result not only completes 
the work conducted by lonescu and Chopard, but also 
gives a perspective for a better insight into the citation 
process. What is more, we proposed a simplification of the 
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Fig. 3: Comparison of the /i-index and its approximations 
on a Scopus data set. The black continuous line is iden¬ 
tity, so ideally all the points should overlie this line. The 
points depicted with o correspond to values given by the 
h-index estimated from the exact solution of the IC sim¬ 
ulation with parameters Nq = l,p = l,q' = 2, where only 
the vector of external citations was taken into account, 
i.e., one that is based on Eq. the points marked with 
-b correspond to the estimate of the h-index that is based 
only on a vector of external citations (Eq. (O) with pa¬ 
rameter q" = 3, while the points marked with A corre¬ 
spond to the approximation given by Eq. m with also 
q" = 3. The dotted lines of corresponding color, depicted 

as - , — . — . — and . , are the least squares 

fit of the h-index values and considered approximations, 
respectively. 


IC model and presented the approximation of the h-index 
based on such an approach. The obtained exact formulas 
were compared with the results of simulations proposed 
by lonescu and Chopard. Interestingly, we may observe 
a good level of compatibility between them, but mostly 
for a large number of papers and citations. In this case, 
however, simulations are more computationally demand¬ 
ing, which makes the usage of the exact formulas more 
preferable. Also a real data evaluation on an informetric 
data set was presented. 

There are still many issues worth deeper investigation. 
Eirst of all, due to the analytical formulas one may analyze 
the theoretical properties of the h-index estimate. Since 
it has been shown that the h-index is an example of an 
aggregation operator and its properties can be studied by 
the means of aggregation theory, it is worth to investigate 
if such properties are still valid when it comes to the IC 
model estimate. 

Moreover, also the theoretical evaluation of the influ¬ 
ence of the considered parameters on the results, which 


Table 4: The approximations of the Hirsch h-index calcu¬ 
lated via Eq. (du and based on Eq. for various param¬ 
eters q. 


q 

Eq. (fTTIl 

Rounded values of Eq. (Hill 

Eq. (jgj 

1 

23.14 

23 

23 

1.5 

34.75 

35 

35 

2 

44.27 

44 

44 

2.5 

51.99 

52 

52 

3 

58.30 

58 

58 

3.5 

63.52 

64 

63 

4 

67.91 

68 

68 

4.5 

71.62 

72 

72 

5 

74.82 

75 

75 



Fig. 4: The estimated values of the h-index of Hirsch him¬ 
self based on the Eq. (0 in each time point t [q = 2.5). The 

vertical line (-) depicts the real value of his h-index 

(equal to 52), while the vertical line depicts the current 
time point. 


has been done by lonescu and Chopard only by an em¬ 
pirical study, should be performed. Note that the exact 
formula for the approximation given by Eq. (17) as well 
as the comparative study of the proposed approximations 
of the Hirsch index and the ones already available in the 
literature opens an interesting future research direction. 
Also, it is reasonable to perform similar analysis on dif¬ 
ferent data sets, for example representing the data con¬ 
cerning the social network (Facebook, Twitter) users or 
citation information gathered from different fields of sci¬ 
ence. 

Also, there are a lot of variations of the classical prefer¬ 
ential attachment rule, proposed by Barabasi and Albert. 
There is also a rich discussion in the literature on the 
proper version of those mechanisms for considered prob¬ 
lem [TlHUfTn] . Mostly due to the simplicity (and for agree- 
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merit with original work HZ]) we chose a classical linear 
version [8l[TT|. The analysis of different forms of preferen¬ 
tial attachment rule (as those presented in [T2] or m) 
is also left for future studies. 
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